{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "co2 = pd.read_csv('data/co2_mm_mlo.txt', header=None, skiprows=72, \n",
    "                  sep='\\s+',\n",
    "                  names=['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'days'])\n",
    "co2_NA = co2.replace(-99.99, np.NaN)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(ch:wrangling_structure)=\n",
    "# Modifying Structure\n",
    "\n",
    "If a dataframe has an inconvenient structure, it can be difficult to do the\n",
    "analysis that we want. The wrangling process often reshapes the dataframe in\n",
    "some way to make the analysis easier and more natural. These changes can simply\n",
    "take a subset of the rows and/or columns from the table or change the table's\n",
    "granularity in a more fundamental way. In this section, we use the techniques from\n",
    "{numref}`Chapter %s <ch:pandas>` to show how to modify structure in the following ways:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Simplify the structure\n",
    ": If a dataframe has features that are not needed\n",
    "in our analysis, then we may want to drop these extraneous columns to make\n",
    "handling the dataframe easier. Or if we want to focus on a particular period\n",
    "of time or geographic area, we may want to take a subset of the rows\n",
    "(subsetting is covered in {numref}`Chapter %s <ch:pandas>`). \n",
    "In {numref}`Chapter %s <ch:files>`, we'll read into\n",
    "our dataframe a small set of features from the hundreds available in the DAWN\n",
    "survey because we were interested in understanding the patterns of type of ER\n",
    "visit by demographics of the patient. In\n",
    "{numref}`Chapter %s <ch:eda>`, we'll restrict an analysis of\n",
    "home sale prices to one year and\n",
    "a few cities in an effort to reduce the impact of inflation and to better study\n",
    "the effect of location on sale price.  "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Adjust the granularity\n",
    ": In an earlier example in this chapter, \n",
    "CO<sub>2</sub> measurements were aggregated from monthly averages to yearly averages in order to \n",
    "better visualize annual trends. In the next section, we provide another example\n",
    "where we aggregate violation-level data to the inspection level so that it can\n",
    "be combined with the restaurant inspection scores. In both of these examples, we adjust\n",
    "the granularity of the dataframe to work with a coarser granularity\n",
    "by grouping together records and aggregating values.\n",
    "With the CO<sub>2</sub> measurements, we grouped the monthly values from the same year and then\n",
    "averaged them. Other common aggregations of a group are the number of records,\n",
    "sum, minimum, maximum, and first or last value in the group. The details\n",
    "of adjusting granularity of `pandas` dataframes can be found in\n",
    "{numref}`Chapter %s <ch:pandas>`, including how to group by multiple column values."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Address mixed granularity \n",
    ": At times, a dataset might have mixed granularity, where records are at different\n",
    "levels of detail. A common case is in data provided by government agencies\n",
    "where data at the county and state levels are included in the same file. When\n",
    "this happens, we usually want to split the dataframe into two, one at the\n",
    "county level and the other at the state level. This makes county-level and\n",
    "state-level analyses much easier, even feasible, to perform.\n",
    "\n",
    "Reshape the structure\n",
    ": Data, especially from government sources, can be shared as pivot\n",
    "tables. These *wide* tables have data values as column names and are often\n",
    "difficult to use in analysis. We may need to reshape them into a *long*\n",
    "form. {numref}`Figure %s <wide-vs-long>` depicts the same data stored in both\n",
    "wide and long data tables.  Each row of the wide data table maps to three rows in the long data table, as highlighted in the tables. Notice that in the wide data table, each row has three values, one for each month. In the long data table, each row only has a value for one month. Long data tables are generally easier to aggregate for future analysis. Because of this, long-form data is also frequently called [*tidy data*](https://doi.org/10.18637/jss.v059.i10)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{figure} figures/wide-vs-long.svg\n",
    "---\n",
    "name: wide-vs-long\n",
    "---\n",
    "An example of a wide data table (top) and a long data table (bottom) containing the same data\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To demonstrate reshaping, we can put the CO<sub>2</sub> data into a wide dataframe that is like a pivot table in shape. There is a column for each month and a row for each year:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Mo</th>\n",
       "      <th>Yr</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>...</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>10</th>\n",
       "      <th>11</th>\n",
       "      <th>12</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1959</td>\n",
       "      <td>315.62</td>\n",
       "      <td>316.38</td>\n",
       "      <td>316.71</td>\n",
       "      <td>317.72</td>\n",
       "      <td>...</td>\n",
       "      <td>314.80</td>\n",
       "      <td>313.84</td>\n",
       "      <td>313.26</td>\n",
       "      <td>314.8</td>\n",
       "      <td>315.58</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1960</td>\n",
       "      <td>316.43</td>\n",
       "      <td>316.97</td>\n",
       "      <td>317.58</td>\n",
       "      <td>319.02</td>\n",
       "      <td>...</td>\n",
       "      <td>315.91</td>\n",
       "      <td>314.16</td>\n",
       "      <td>313.83</td>\n",
       "      <td>315.0</td>\n",
       "      <td>316.19</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Mo    Yr       1       2       3       4  ...       8       9      10     11  \\\n",
       "0   1959  315.62  316.38  316.71  317.72  ...  314.80  313.84  313.26  314.8   \n",
       "1   1960  316.43  316.97  317.58  319.02  ...  315.91  314.16  313.83  315.0   \n",
       "\n",
       "Mo      12  \n",
       "0   315.58  \n",
       "1   316.19  \n",
       "\n",
       "[2 rows x 13 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "co2_pivot = pd.pivot_table(\n",
    "    co2[10:34],\n",
    "    index='Yr',   # Column to turn into new index\n",
    "    columns='Mo',  # Column to turn into new columns\n",
    "    values='Avg') # Column to aggregate \n",
    "\n",
    "co2_wide = co2_pivot.reset_index()\n",
    "\n",
    "display_df(co2_wide, cols=10)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The column headings are months, and the cell values in the grid are the CO<sub>2</sub> monthly averages. We can turn this dataframe back into a long, aka *tall*, dataframe, where the column names become a feature, called `month`, and the values in the grid are reorganized into a second feature, called `average`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Yr</th>\n",
       "      <th>month</th>\n",
       "      <th>average</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1959</td>\n",
       "      <td>1</td>\n",
       "      <td>315.62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1960</td>\n",
       "      <td>1</td>\n",
       "      <td>316.43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>1959</td>\n",
       "      <td>12</td>\n",
       "      <td>315.58</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>1960</td>\n",
       "      <td>12</td>\n",
       "      <td>316.19</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>24 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      Yr month  average\n",
       "0   1959     1   315.62\n",
       "1   1960     1   316.43\n",
       "..   ...   ...      ...\n",
       "22  1959    12   315.58\n",
       "23  1960    12   316.19\n",
       "\n",
       "[24 rows x 3 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "co2_long = co2_wide.melt(id_vars=['Yr'],\n",
    "                         var_name='month',\n",
    "                         value_name='average')\n",
    "\n",
    "display_df(co2_long, rows=4)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the data has been recaptured in its original shape (although the\n",
    "rows are not in their original order). Wide-form data is more common when we\n",
    "expect readers to look at the data table itself, like in an economics article\n",
    "or news story. But long-form data is more useful for data analysis. For\n",
    "instance, `co2_long` lets us write short `pandas` code to group by either year\n",
    "or month, while the wide-form data makes it difficult to group by year.\n",
    "The `.melt()` method is particularly useful for converting wide-form into long-form data. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These structural modifications have focused on a single table. However, we often want to combine information that is spread across multiple tables. In the next section, we combine the techniques introduced in this chapter to wrangle the restaurant inspection data and address joining tables."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
